Learning diary

Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.

Chapter 1.

This is the first assigment on the University of Helsinki MOOC.

My git repository is in following address: https://github.com/toparvia/IODS-project.git

This is the first entry in the in the learning diary. The first lecture introduced basic methods of collaboration such as using Git for your code and writing reports using Markdown and Knitr. I have used these before, but I’m more familiar using them in UNIX than with Windows computers. I felt quite clumsy these, as in my own PhD project I don’t need to collaborate on the methods part as much. I chose to use Git hub desktop and RStudio as it was recommended in the course materials.

I also went through the R Short and Sweet. I think only trouble was with last assignment. I think my approach of function was in different order and it didn’t accept it at first. Finally I achieved to a solution that satisfied the test. I’m curious to see how the course will develop in the future.

Under here is just some tests of RMarkdown functions I tried:

inline equation: \(A = \pi*r^{2}\)

  1. ordered list
  2. item 2
  • sub-item 1
  • sub-item 2
Table Header Second Header
Table Cell Cell 2
Cell 3 Cell 4

Chapter 2.

Describe the work you have done this week and summarize your learning.

  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using
    I moved to the next set excercises DataCamp: Regression and model validation. The exercises were quite interesting and showed some things that I didn’t know before.
library(ggplot2)
p <-ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# Data exploration of the learning2014 with ggpairs

I found this code quite useful for getting an overview of the data. I have previously used a more simple plotting tools for initial analysis such as correlation or histograms separately, but I might use this one again since it provides insight with only one figure.

I looked in to the data provided by the excercise. The data has a huge number of different parameters, point estimates and some categorial variables. I used the common commands for simple analysis.

require(readr)
PKY <- read_delim("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
#working directory to source file location
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) #works in Rstudio only

dim(PKY) #dimensions of the data
str(PKY) #structure of the data
summary(PKY) #summary of the data

I used also some other methods for data exploration, but I’ll write about those in the end of this excercise together with the data analysis.

library(data.table)
#Here we make each of the needed variables for practice
gender      <- data.table(PKY$gender)
age         <- data.table(PKY$Age)
attitude    <- data.table(PKY)[, .(Da+Db+Dc+Dd+De+Df+Dg+Dh+Di+Dj)]
attitude    <- attitude/10
deep        <- data.table(PKY)[, .(D03+D11+D19+D27+D07+D14+D22+D30+D06+D15+D23+D31)]
deep        <- deep/12
surf        <- data.table(PKY)[, .(SU02+SU10+SU18+SU26+SU05+SU13+SU21+SU29+SU08+SU16+SU24+SU32)]
surf        <- surf/12
stra        <- data.table(PKY)[, .(ST01+ST09+ST17+ST25+ST04+ST12+ST20+ST28)]
stra        <- stra/8
points      <- data.table(PKY$Points)
#Combining variables to one data.table
learn2014   <- data.table(cbind(gender, age, attitude, deep, stra, surf, points))
#giving column names for the data.table
colnames(learn2014) <- c("gender", "age", "attitude", "deep", "stra", "surf", "points")
#removing the values with the points of zero
learn2014   <- learn2014[points >0]
#writing the file to the data folder
write.csv(learn2014, file = "data/learn2014.csv")
#reading the table
learn2014_reloaded <- read.csv(file = "data/learn2014.csv")
#showing the data summary with reloaded data
summary(learn2014_reloaded)

I produced the required data for with data.table -package, which I had used only briefly before. I thought this is good place to practice. However, I think I could have written this as one command, but I felt more confident doing this as one phase at the time. Next we start the analysis of the data. # Data analysis

learn2014 <- read.csv(file = "data/learn2014.csv")
require(GGally)
require(ggplot2)

#This function enables running linear regression to each parameter combination and comparison with loess (locally weighted scatterplot smoothing) to observe non-linear relationships between parameters, and their correlation for ggplot2 and GGally packages.
my_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) +
    geom_point() +
    geom_smooth(method=loess, fill="red", color="red", ...) +
    geom_smooth(method=lm, fill="blue", color="blue", ...)
  p
}
g = ggpairs(learn2014,columns = c(2:7), lower = list(continuous = my_fn))
g

# Here we can see that combined parameters don't have high correlation between each other, and parameters compared to points, show linear positive relationship other than age and surf. Sex is excluded in this picture.
#Distributions of data look normally distritubed, at least roughly. Age is not normally distributed, perhaps it could be poisson distribution.
p <-ggpairs(learn2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

#Sex seems to affect attitude the most however these differences look rather small.

#I chose the model parameters based on the data exploration I did before.
#Attitude looked promising so it was in all the models.
m1 <- lm(points ~ attitude*deep*stra*surf+age, data=learn2014)
m2 <- lm(points ~ attitude + stra + surf + age + attitude:stra + 
           attitude:surf + stra:surf + attitude:stra:surf, data = learn2014)
m3 <- lm(points ~ attitude+deep+stra+surf+age, data=learn2014)
m4 <- lm(points ~ attitude + stra + age, data = learn2014)
m5 <- lm(points ~ attitude + stra, data = learn2014)
m6 <- lm(points ~ attitude + stra + gender, data = learn2014)
AIC(m1,m2,m3,m4,m5,m6)
##    df      AIC
## m1 18 1039.982
## m2 10 1031.755
## m3  7 1029.450
## m4  5 1028.225
## m5  4 1029.038
## m6  5 1030.978
#By Akaike's information criteria, the model with the best fit is m5, since it is also most parsimonius.
par(mfrow=c(2,2))
plot(m5)

#Model looks good. Residuals have quite even distribution on both sides of zero and there are only few outliers. The included parameters probably fulfil the normality asumption and are good predictor combination, since their correlation was low with each other (<0.06).
drop1(m5)
## Single term deletions
## 
## Model:
## points ~ attitude + stra
##          Df Sum of Sq    RSS    AIC
## <none>                4559.4 555.95
## attitude  1   1051.89 5611.3 588.41
## stra      1     81.74 4641.1 556.90
#to test whether more parameters would need to be dropped to improve the model.
#Looks like we have the most parsimonious model with the lowest AIC.
summary(m5)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
# Attitude (Global attitude toward statistics) is highly significant and stra (Organized Studying and Time management) is border case with p > 0.09. Model explains only 20 % of variation of the data. Perhaps further data wrangling might be in order or using better models.

Extra

I did some analysis to find the predictors of the full data, just out interest. I found a different set of predictors that were correlating with the points.

library(caret) #findCorrelation algorithm
library(ggcorrplot) #plot the correlations
#PKY is the full dataset
PKY2 <- PKY[-60] # removing the sex
correlations <- abs(cor(PKY2, use="pairwise.complete.obs"))
#correlation of all parameters with positive numbers
correlations.selected <- findCorrelation(correlations, cutoff = .7, exact = TRUE)
#colums with to select by cutoff 0.7
corr <- cor(PKY2[c(correlations.selected,59)])
#correlations of narrow selection and points variable

# Ploting correaltions
ggcorrplot(corr, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of PKY", 
           ggtheme=theme_bw)

What does this do? It takes all the predictors and find all the correlations from the data based on a cutoff, so basically you can look which questions have correlation with each other and then compares it with the value of interest, in this case, the points.

Also one other thing I found handy since I’ve worked with two different computers with this data set, that now I can easily migrate between and just use the git to move the data. I also found this function handy, so that I can use the same script in both computers.

#working directory to source file location
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) #works in Rstudio only

Basically it sets the working directory to the same as you open the source without changing the settings of R-studio.


Chapter 3. Logistic regression

This time I had much less time to use for this exercise, so I used the methods given mostly.

Data Set Information:

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

# To empty the memory after the excercise before this
rm(list=ls())
# Load data
#setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

alc <- read.csv(file = "data/ex3alc.csv")

glimpse(alc)
## Observations: 382
## Variables: 36
## $ X          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ school     <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize    <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus    <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob       <fctr> teacher, other, other, services, other, other, oth...
## $ reason     <fctr> course, course, other, home, home, reputation, hom...
## $ nursery    <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet   <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian   <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup     <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid       <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic   <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
#summary(alc)

# The data set is combined from two different files from the machine learning repository


#alc is the full dataset, we take numeric values for correlation analysis
alc2 <- select_if(alc, is.numeric)
correlations <- abs(cor(alc2, use="pairwise.complete.obs"))
#correlation of all parameters with absolute numbers
correlations.selected <- findCorrelation(correlations, cutoff = .3, exact = TRUE)
#colums with to select by cutoff 0.7
#Making sure that grades are part of the correlation table
grade.columns <- 15:17
correlations.selected <- union(correlations.selected,grade.columns )
# calculating the variables
corr <- cor(alc2[c(correlations.selected)])
#correlations of narrow selection and grades variables G1, G2, G3

# Ploting correaltions
ggcorrplot(corr, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of PKY", 
           ggtheme=theme_bw)

# Now we see the highest correlation to G1:G3 is with age, Medu, alc_use and Walc. Since alc_use and Walc have high correlation, only better one is selected for the analysis. G3 is the final grade, so we first try to only predict that and drop the G1 and G2, even thought it would be likely that they would be good predictors of G3. 

# So we want to predict the Final grade of the students, with their alcohol use, age and their mothers education level.

# I hypotise that alcohol use would reduce the performance and mothers educational level would increase their performance.

my_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) +
    geom_point() +
    geom_smooth(method=loess, fill="red", color="red", ...) +
    geom_smooth(method=lm, fill="blue", color="blue", ...)
  p
}
g = ggpairs(alc2,columns = c(correlations.selected), lower = list(continuous = my_fn))
g

# Lets add some factor variables to the data
names.selected <- colnames(alc2[correlations.selected])
alc3 <- subset(alc, select=c(names.selected, "sex", "school", "guardian", "activities"))

p <-ggpairs(alc3, mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

# The categorial variables are quite similar and does not give much indication other than, that the high alcohol use is higher with males and almost no alcohol use is more common with females.

m <- glm(G3 ~ alc_use + Walc + age + Medu, family = "gaussian", data = alc3)
m2 <- glm(G3 ~ alc_use + age + Medu, family = "gaussian", data = alc3)
deviance(m2)/m2$null.deviance
## [1] 0.9052058
AIC(m,m2)
##    df      AIC
## m   6 1969.777
## m2  5 1969.543
par(mfrow=c(2,2))
plot(m2)

# Akaike information criteria gives similar result, so as hypotised the Walc is dropped as they have covariance, that way we achive the most parsimonious model.

# Model explains (R^2) 91 % of the variation without the help of G1 and G2.

# !!! Oh, I only realized at this point that the target variable was supposed to be high/low alcohol consumption, now I realize why logistic regression in this case... !!!

# I suppose I need to do another selection for parameters for this case..

# Feature selection based on LASSO (least absolute shrinkage and selection operator) using centering and scaling as preprosessing and then 10-fold crossvalidation
fit <- sbf(
  form = alc$alc_use ~ .,
  data = alc[c(2:27,29:31)], method = "glmnet", # Dalc and Walc are dropped as they are the parameters high_use is based on, D1:D3, dropped as well, since grades are known only after students are done with the studies (especially final exam G3)
  tuneGrid=expand.grid(.alpha = .01, .lambda = .1),
  preProc = c("center", "scale"),
  trControl = trainControl(method = "none"),
  sbfControl = sbfControl(functions = caretSBF, method = 'cv', number = 10) 
)
fit
## 
## Selection By Filter
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance:
## 
##    RMSE Rsquared    MAE  RMSESD RsquaredSD   MAESD
##  0.3502   0.8844 0.2429 0.04492    0.02063 0.01825
## 
## Using the training set, 15 variables were selected:
##    sexM, age, addressU, Fjobservices, reasonother...
## 
## During resampling, the top 5 selected variables (out of a possible 19):
##    absences (100%), freetime (100%), goout (100%), reasonother (100%), sexM (100%)
## 
## On average, 13.7 variables were selected (min = 11, max = 16)
# Dalc is dropped as it is one of the parameters high_use is based on
bm  <- glm(high_use ~ failures + absences + freetime + age, data = alc, family = "binomial")
bm2 <- glm(high_use ~ failures + absences + freetime, data = alc, family = "binomial")
bm3 <- glm(high_use ~ absences + age + G1+ G2+ G3, data = alc, family = "binomial") # here grades are included
bm4 <- glm(high_use ~ absences + age + G1, data = alc, family = "binomial") # here grades are included
bm5 <- glm(high_use ~ failures + absences + freetime + age + goout, data = alc, family = "binomial")
bm6 <- glm(high_use ~ failures + absences + goout, data = alc, family = "binomial")
# Stepwise forward and backward selection
step(bm6, direction = "both")
## Start:  AIC=406.4
## high_use ~ failures + absences + goout
## 
##            Df Deviance    AIC
## <none>          398.40 406.40
## - failures  1   402.50 408.50
## - absences  1   410.92 416.92
## - goout     1   440.14 446.14
## 
## Call:  glm(formula = high_use ~ failures + absences + goout, family = "binomial", 
##     data = alc)
## 
## Coefficients:
## (Intercept)     failures     absences        goout  
##    -3.65199      0.41507      0.07547      0.71311  
## 
## Degrees of Freedom: 381 Total (i.e. Null);  378 Residual
## Null Deviance:       465.7 
## Residual Deviance: 398.4     AIC: 406.4
AIC(bm,bm2,bm3,bm4,bm5,bm6) # better fit found without grades
##     df      AIC
## bm   5 439.7969
## bm2  4 439.4485
## bm3  6 451.1008
## bm4  4 447.3286
## bm5  6 409.0303
## bm6  4 406.3965
# bm6 is the best fit based, preprosessed parameters by centering and scaling, then using feature selection using LASSO with 10-fold crossvalidation , removing confounding parameters.
# The model was further validated with stepwise forward and backward selection and confirmed checking the Akaike information criteria.
par(mfrow=c(2,2))
plot(bm6)

# The model validation is bit more tricky with binomial models, you won't see such neat plots unless the number of observations is very high (N > 400). In this case plots look ok.


# These are loaded only now, since they mask some functions from dplyr
library(MASS)
library(fitdistrplus)
alc_num <- as.numeric(alc$high_use)
fitBinom=fitdist(data=alc_num, dist="binom", fix.arg=list(size=2), start=list(prob=0.3))
plot(pnbinom(sort(alc_num), size=2, mu=fitBinom$estimate), ppoints(alc_num))
abline(0,1)
# This shows the estimated binomial distribution in our response variable.

# compute odds ratios (OR)
OR <- coef(bm6) %>% exp
# compute confidence intervals (CI)
CI <- exp(confint(bm6))
cbind(OR, CI)
##                     OR     2.5 %     97.5 %
## (Intercept) 0.02593937 0.0107041 0.05891273
## failures    1.51447465 1.0137625 2.27486120
## absences    1.07838996 1.0341084 1.12889344
## goout       2.04033617 1.6296451 2.58472531
# predictions
pred.bm6 <- predict(bm6, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = pred.bm6)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = pred.bm6 > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
alc[36:38] %>% tail(10)
##     high_use probability prediction
## 373    FALSE  0.14055409      FALSE
## 374     TRUE  0.36139910      FALSE
## 375    FALSE  0.19198237      FALSE
## 376    FALSE  0.25734107      FALSE
## 377    FALSE  0.15979460      FALSE
## 378    FALSE  0.34330586      FALSE
## 379    FALSE  0.22362093      FALSE
## 380    FALSE  0.06224146      FALSE
## 381     TRUE  0.55365666       TRUE
## 382     TRUE  0.05797935      FALSE
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = c(high_use), y = probability, color = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64397906 0.05759162 0.70157068
##    TRUE  0.18062827 0.11780105 0.29842932
##    Sum   0.82460733 0.17539267 1.00000000
# how many are correct
sum(alc[36]==alc[38])/382
## [1] 0.7617801
# 76.178 % are correct
1-sum(alc[36]==alc[38])/382
## [1] 0.2382199
# Training error is 23.821 % (DataCamp is 26 %)

# Here is the DataCamp version of training error, (which I find more complicated, but perhaps easier to follow).
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2382199
# The model is better than the one introduced in DataCamp

# K-fold cross-validation
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:survival':
## 
##     aml
## The following object is masked from 'package:lattice':
## 
##     melanoma
cv <- cv.glm(data = alc, cost = loss_func, glmfit = bm6, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2486911

Training error is 23.821 % (not validated) (DataCamp is > 26 %) Cross-validated training error is 24.8699 %

Other notes

# I found this very useful in the DataCamp exercises

# produce summary statistics by group with piping variable #>#
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade=mean(G3))

This weeks exercise was interesting. In the end I tried some new methods I haven’t tried before, such as the LASSO feature selection. I wonder how the parameters were selected in the DataCamp exercises? Was the method in any way similar than the LASSO approach I used. I also tested LASSO without the preprosessing and the results differed. That model had much poorer results than the one with preprosessing. I’m curious how that can effect the LASSO results so much.


Chapter 4. Clustering and classification

I completed all the DataCamp exercises prior to proceeding to this exercise. I also looked up some details how to use the RMarkdown more productively and hopefully this report will be clearer than the previous ones.

Today we are looking at the Boston crime

Boston Data set information:

Variables Description
crim per capita crime rate by town.
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town.
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox nitrogen oxides concentration (parts per 10 million).
rm average number of rooms per dwelling.
age proportion of owner-occupied units built prior to 1940.
dis weighted mean of distances to five Boston employment centres.
rad index of accessibility to radial highways.
tax full-value property-tax rate per $10,000.
ptratio pupil-teacher ratio by town.
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat lower status of the population (percent).
medv median value of owner-occupied homes divided by $1000s.
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
require("easypackages") # for loading and installing many packages at one time
packages_to_load <- c("broom", "dplyr", "MASS", "tidyverse", "corrplot", "ggplot2", "GGally", "caret", "devtools", "ggthemes", "scales", "plotly")
packages(packages_to_load, prompt = TRUE) # lock'n'load install/load
#Additionally
#install_github("fawda123/ggord") # Installed from Github for vizualization
library(ggord)
# To empty the memory after the excercise before this
# rm(list=ls())
# Load data
# setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) # to set working directory to source file

# load the data
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

All are numeric variables except chas and rad. The data has demographical data of boston including tax and other information possibly linked to crime rates within the city.

my_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) +
    geom_point() +
    geom_smooth(method=loess, fill="red", color="red", ...) +
    geom_smooth(method=lm, fill="blue", color="blue", ...)
  p
}
g = ggpairs(Boston,columns = c(1:14), lower = list(continuous = my_fn))
g

  1. It seems that indus, nox, rad, tax and lstat have positive correlation with crime, and dis, black and medv have negative correlation with crime.
  2. Many of the variables seems to have non-linear relationships (red lines in the picture), however the variables with high correlation seems to be more linear.
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

More focused figure on correlation, since they are hardly visible with this many variables.

# Scaling the variables with mean and standard deviation with scale()
Boston_scaled <- as.data.frame(scale(Boston))
# create a quantile vector of crime and print it
bins <- quantile(Boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crim'
crime <- cut(Boston_scaled$crim, breaks = bins, include.lowest= TRUE, label = c("low","med_low","med_high", "high"))

# remove original crim from the dataset
Boston_scaled <- dplyr::select(Boston_scaled, -crim)

# add the new categorical value to scaled data
Boston_scaled <- data.frame(Boston_scaled, crime)

# number of rows in the Boston dataset
n <-nrow(Boston)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- Boston_scaled[ind,]

# create test set
test <- Boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <-test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

summary(train)
##        zn              indus               chas          
##  Min.   :-0.4872   Min.   :-1.51549   Min.   :-0.272329  
##  1st Qu.:-0.4872   1st Qu.:-0.86902   1st Qu.:-0.272329  
##  Median :-0.4872   Median :-0.18028   Median :-0.272329  
##  Mean   : 0.0033   Mean   : 0.03352   Mean   :-0.009206  
##  3rd Qu.:-0.3532   3rd Qu.: 1.01499   3rd Qu.:-0.272329  
##  Max.   : 3.8005   Max.   : 2.42017   Max.   : 3.664771  
##       nox                rm                age          
##  Min.   :-1.4644   Min.   :-3.44659   Min.   :-2.33313  
##  1st Qu.:-0.8452   1st Qu.:-0.58052   1st Qu.:-0.78067  
##  Median :-0.1441   Median :-0.12401   Median : 0.38812  
##  Mean   : 0.0514   Mean   :-0.01551   Mean   : 0.04799  
##  3rd Qu.: 0.7966   3rd Qu.: 0.49866   3rd Qu.: 0.91834  
##  Max.   : 2.7296   Max.   : 3.47325   Max.   : 1.11639  
##       dis                rad                tax          
##  Min.   :-1.26582   Min.   :-0.98187   Min.   :-1.31269  
##  1st Qu.:-0.83538   1st Qu.:-0.63733   1st Qu.:-0.75495  
##  Median :-0.33209   Median :-0.52248   Median :-0.42268  
##  Mean   :-0.03155   Mean   : 0.05459   Mean   : 0.06195  
##  3rd Qu.: 0.61991   3rd Qu.: 1.65960   3rd Qu.: 1.52941  
##  Max.   : 3.95660   Max.   : 1.65960   Max.   : 1.79642  
##     ptratio             black             lstat               medv        
##  Min.   :-2.70470   Min.   :-3.9033   Min.   :-1.52961   Min.   :-1.9063  
##  1st Qu.:-0.48756   1st Qu.: 0.2017   1st Qu.:-0.74682   1st Qu.:-0.6913  
##  Median : 0.25149   Median : 0.3837   Median :-0.13696   Median :-0.1721  
##  Mean   : 0.01459   Mean   :-0.0200   Mean   : 0.04816   Mean   :-0.0273  
##  3rd Qu.: 0.80578   3rd Qu.: 0.4331   3rd Qu.: 0.63603   3rd Qu.: 0.2683  
##  Max.   : 1.63721   Max.   : 0.4406   Max.   : 3.54526   Max.   : 2.9865  
##       crime    
##  low     : 98  
##  med_low : 89  
##  med_high:105  
##  high    :112  
##                
## 

Now we have generated the test dataset with the new categorial variable “crime”, which is based on the quantiles of the original numeric variable.

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2425743 0.2202970 0.2599010 0.2772277 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       1.05611797 -0.9539497 -0.15180559 -0.8847310  0.46084014
## med_low  -0.08617034 -0.2158360 -0.00690658 -0.5553086 -0.09722015
## med_high -0.38025152  0.1175536  0.17762524  0.3765069  0.02939201
## high     -0.48724019  1.0169208 -0.06141298  1.0478378 -0.40947736
##                 age        dis        rad        tax     ptratio
## low      -0.8542196  0.9609043 -0.7029578 -0.7323671 -0.43618146
## med_low  -0.3488940  0.3223541 -0.5611968 -0.4396129 -0.08896864
## med_high  0.3973215 -0.3679215 -0.4076377 -0.3219809 -0.29663576
## high      0.8253033 -0.8658225  1.6401200  1.5154800  0.78309558
##                black       lstat         medv
## low       0.37524015 -0.77999636  0.524660469
## med_low   0.34304712 -0.11341692 -0.008331595
## med_high  0.09939918  0.03604687  0.143270083
## high     -0.76627702  0.91253805 -0.685264527
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.131659777  0.708316409 -0.78209681
## indus   -0.050784099 -0.313038607  0.64354579
## chas    -0.008044673 -0.051096072  0.10481470
## nox      0.418782526 -0.541165008 -1.47846665
## rm       0.012765542 -0.001436898  0.04122097
## age      0.303432313 -0.230379508 -0.40368805
## dis     -0.104954903 -0.068615211  0.02136377
## rad      3.192497180  0.666766947 -0.04642300
## tax     -0.014156153  0.236978820  0.50168061
## ptratio  0.148347587  0.113054637 -0.33483539
## black   -0.114889332 -0.001692874  0.16208788
## lstat    0.155073861 -0.202889925  0.40692745
## medv     0.059628404 -0.310170031 -0.30539105
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9524 0.0346 0.0130
g <- ggord(lda.fit, train$crime, ellipse_pro = 0.95, vec_ext = 2, alpha_el = 0.3, size = 2)
g  + theme_tufte()  + geom_rangeframe()

  1. We get the points seperated only by the 1st LD and rad parameter in the high group.
  2. From the LDA summary, we can see that how the first linear discriminant explain 94.72 % of the between-group variance in the Boston dataset.
  3. Also the rad (= index of accessibility to radial highways) has relatively high value in the first discriminant, so it proves to be uselful in the classification. rad had negative correlation with crime.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18      11        0    0
##   med_low    7      15       15    0
##   med_high   0       3       17    1
##   high       0       0        0   15

In the high -crime quartile, we could predict all cases, but for other classes we didn’t do so well. However, LDA seems to be able to predict our cases with quite good accuracy.

# k-means clustering
BostonS <- scale(Boston) # standardize variables

wss <- (nrow(BostonS)-1)*sum(apply(BostonS,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(BostonS, 
    centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

km <-kmeans(BostonS, centers = 4)

# plot the Boston dataset with clusters
pairs(BostonS, col = km$cluster)

We see similar results than in LDA

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

Now we can see the 3D picture of the clusters!


Chapter 5. Dimensionality reduction techniques

I completed all the DataCamp exercises prior to proceeding to this exercise. Today we are foraying to the mystical world of GNI and then it’s time for tea.

I found this tip for making the Rmarkdown neater. This looks bit similar solution than I used. I stored the css in seperate file that the Knitr can load from “styles.css”. Look at the index file if interested.

M

First the GNI.

GNI Data set information:

Variables Description
Edu2.FM Secondary education Female/male index
Labo.F Female labour participation rate
Life.Exp Life expentancy at birth
Edu.Exp Educational expentancy
GNI Gender Development index
Mat.Mor Mortality of mothers
Ado.Birth Adolecent birth rate
Parli.F Parlamentary represtantion of females
require("easypackages") # for loading and installing many packages at one time
packages_to_load <- c("broom", "dplyr", "tidyverse", "corrplot", "ggplot2", "GGally", "devtools", "ggthemes", "stringr", "FactoMineR", "factoextra")
packages(packages_to_load, prompt = TRUE) # lock'n'load install/load
human <- read.table("data/humaf.csv", sep  =",", header = T, row.names = 1)
glimpse(human)
## Observations: 155
## Variables: 8
## $ Edu2.FM   <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.969060...
## $ Labo.FM   <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.828611...
## $ Life.Exp  <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, 82.0...
## $ Edu.Exp   <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, 15.9...
## $ GNI       <int> 64992, 42261, 56431, 44025, 45435, 43919, 39568, 529...
## $ Mat.Mor   <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, 2, 1...
## $ Ado.Birth <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, 25.3...
## $ Parli.F   <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, 28.2...
ggpairs(
  human, 1:8,
  lower = list(combo = wrap("facethist", binwidth = 0.5
)))

ggduo(
  human, 5, c(1:4,6:8),
  types = list(continuous = "smooth_loess", mapping = aes(color = GNI)),
  title = "GNI constituents loess smooth",
  xlab = "",
  ylab = ""
)

  1. The data seems non-normal, and all variables seems to have strong correlation to GNI except Lab.FM.
  2. Many of the variables seems to have non-linear or perhaps logaritmic relationships (blue lines in the picture).
human_std <- scale(human) %>% as.data.frame()
ggpairs(
  human_std, 1:8,
  lower = list(combo = wrap("facethist", binwidth = 0.5
  )))

ggduo(
  human_std, 5, c(1:4,6:8),
  types = list(continuous = "smooth_loess", mapping = aes(color = GNI)),
  title = "GNI constituents loess smooth",
  xlab = "",
  ylab = ""
)

The scaled dataset looks very similar, however, some extreme values are now drawn more closer.

# Running the principal component analysis'
pca_human       <- prcomp(human)
pca_human_std   <- prcomp(human_std)

# comparison summaries
s_pca_human     <- summary(pca_human)
s_pca_human_std <- summary(pca_human_std)
s_pca_human
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
s_pca_human_std
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
# Generating labels
pca_pr      <- round(100*s_pca_human$importance[2, ], digits = 1)
pca_pr_std <- round(100*s_pca_human_std$importance[2, ], digits = 1)
pca_pr_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
pca_pr_lab_std <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")

#Plotting 2

biplot(
  pca_human, cex = c(0.8, 1),
  col = c("grey40", "deeppink2"),
  xlab = pca_pr_lab[1],
  ylab = pca_pr_lab[2],
  main= "Principal component analysis of GNI components (unstandardized)"
)

biplot(
  pca_human_std,
  cex = c(0.8, 1),
  col = c("black", "green2"),
  xlab = pca_pr_lab_std[1],
  ylab = pca_pr_lab_std[2],
  main= "Principal component analysis of GNI components (standardized)"
)

Non-scaled results are non-usable. All the nearly all deviation are explained by the first dimension as the uneven scale makes the big numbers have huge importance.

GNI Data standardized PCA results for the 1st and 2nd dimension:

Variables Description PC1 PC2
Edu2.FM Secondary education Female/male index -0.35664370 0.03796058
Labo.F Female labour participation rate 0.05457785 0.72432726
Life.Exp Life expentancy at birth -0.44372240 -0.02530473
Edu.Exp Educational expentancy -0.42766720 0.13940571
GNI Gender Development index -0.35048295 0.05060876
Mat.Mor Mortality of mothers 0.43697098 0.14508727
Ado.Birth Adolecent birth rate 0.41126010 0.07708468
Parli.F Parlamentary represtantion of females -0.08438558 0.65136866

PCA splits the data to eight dimensions where first three explain 79 % of the variance. First two dimensions explain 69 %. Horizontal dimension is explained by education and life expectancy and Gender development index, and to the right (positive) dimension is explained by mortality of mothers and adolecent brith rate. 2nd dimension is explained by parlamentary representation of females and labor participation ratio.

data(tea)

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
tea_time %>% summary()
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
#str(tea_time)

# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

 mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
 plot(mca, invisible=c("ind"), habillage = "quali")

 plot(mca, habillage = "quali")

 # Screeplot dimension explaining for variance
 fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 50))

 #Strongest 1 dimension
 fviz_mca_ind(mca,  habillage = "how",
              addEllipses = TRUE, repel = TRUE)

 #Stronggest 2 dimension
 fviz_mca_ind(mca,  habillage = "where",
           addEllipses = TRUE, repel = TRUE)

 fviz_mca_var(mca, repel = TRUE) 

  1. Here none of the variables have such strong explatory power as in the example before.
  2. First four dimensions explain roughly 50 % of the variance.
  3. “How”" and “where” variables split the 1st and the 2nd dimension, where as the 6 dimensions, the separation of the individuals by them is not good.
  4. Basically the individuals in the data set is best separate to groups by where they drink ther tea or do the drink it in premade bags or tea bags.
  5. Perhaps the other notable charasteristic was the weather you like green tea or earl grey, since they seemed to differ in the dataset (3rd dimension).